test: add e2e test for NVIDIA device plugin as DaemonSet by ganeshkumarashok · Pull Request #7964 · Azure/AgentBaker

ganeshkumarashok · 2026-02-26T17:55:58Z

Summary

Add a new e2e test that validates GPU nodes work correctly when the NVIDIA device plugin is deployed as a Kubernetes DaemonSet instead of a systemd service
This tests the upstream deployment model commonly used by customers who manage their own device plugin deployment

Test Details

The test Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset:

Provisions a GPU node (Standard_NV6ads_A10_v5) with GPU drivers enabled but systemd device plugin disabled
Deploys nvidia-device-plugin:v0.18.2 from MCR (mcr.microsoft.com/oss/v2/nvidia/k8s-device-plugin) as a DaemonSet
Validates:
- GPU resources are advertised by the device plugin
- GPU workloads can be scheduled on the node

Test plan

CI pipeline runs the new GPU e2e test
Verify the DaemonSet-based device plugin properly registers GPU resources
Verify GPU workloads can be scheduled

Copilot

Pull request overview

This pull request adds a new e2e test (Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset) that validates GPU nodes work correctly when the NVIDIA device plugin is deployed as a Kubernetes DaemonSet instead of a systemd service. This tests the upstream deployment model commonly used by customers who manage their own device plugin deployment.

Changes:

Adds a new GPU e2e test that provisions a Standard_NV6ads_A10_v5 GPU node with drivers enabled but systemd device plugin disabled
Deploys nvidia-device-plugin:v0.18.2 from MCR as a DaemonSet
Validates GPU resource advertisement and workload scheduling with DaemonSet-based device plugin

e2e/scenario_gpu_daemonset_test.go

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 6 comments.

Copilot · 2026-02-26T19:27:32Z

e2e/scenario_gpu_daemonset_test.go

+		if pod.Status.Phase == corev1.PodRunning {
+			// Check if all containers are ready
+			for _, containerStatus := range pod.Status.ContainerStatuses {
+				if !containerStatus.Ready {
+					s.T.Logf("Container %s is not ready yet", containerStatus.Name)
+					return false, nil
+				}
+			}
+			return true, nil


The wait loop checks if any containers in the pod are not ready, but it only logs which container is not ready without failing immediately on errors like CrashLoopBackOff or ImagePullBackOff. If the pod enters a failed state, the test will wait the full 3 minutes before timing out.

Consider checking pod.Status.ContainerStatuses[].State.Waiting.Reason for error states (ImagePullBackOff, CrashLoopBackOff, etc.) and returning an error immediately rather than continuing to poll. This would make test failures faster and provide clearer error messages.

Copilot · 2026-02-26T19:27:32Z

e2e/scenario_gpu_daemonset_test.go

+								AllowPrivilegeEscalation: to.Ptr(false),
+								Capabilities: &corev1.Capabilities{
+									Drop: []corev1.Capability{"ALL"},
+								},


The SecurityContext configuration may prevent the device plugin from functioning correctly. The NVIDIA device plugin needs to interact with GPU devices and the kubelet device plugin socket, which typically requires either:

Privileged mode (Privileged: true), OR

Specific capabilities like SYS_ADMIN

The current configuration only drops all capabilities without granting the necessary ones. The upstream NVIDIA device plugin DaemonSet spec typically does NOT use AllowPrivilegeEscalation: false with dropped capabilities - it usually runs in privileged mode or with specific capabilities.

Without proper privileges, the device plugin won't be able to enumerate GPUs or create device plugin sockets. This could cause the test to fail when it tries to validate GPU resources are advertised. Verify this configuration works, or update it to match the upstream privileged configuration.

Suggested change

AllowPrivilegeEscalation: to.Ptr(false),

Capabilities: &corev1.Capabilities{

Drop: []corev1.Capability{"ALL"},

},

Privileged: to.Ptr(true),

Copilot · 2026-02-26T19:27:32Z

e2e/scenario_gpu_daemonset_test.go

+					Containers: []corev1.Container{
+						{
+							Name:  "nvidia-device-plugin-ctr",
+							Image: "mcr.microsoft.com/oss/v2/nvidia/k8s-device-plugin:v0.18.2",


The image tag v0.18.2 does not match the version format used in parts/common/components.json (0.18.2-ubuntu22.04u1). While MCR may support the simplified v0.18.2 tag, it's better to use the exact version that aligns with what's cached on VHDs and referenced in components.json for consistency. This ensures the test validates the same version that production nodes would use.

The components.json shows nvidia-device-plugin versions like "0.18.2-ubuntu22.04u1" for Ubuntu 22.04. Consider using the full version tag or documenting why the simplified tag is intentional.

Copilot · 2026-02-26T19:27:32Z

e2e/scenario_gpu_daemonset_test.go

+	ds := nvidiaDevicePluginDaemonset(s.Runtime.VM.KubeName)
+	err := s.Runtime.Cluster.Kube.CreateDaemonset(ctx, ds)
+	require.NoError(s.T, err, "failed to create NVIDIA device plugin DaemonSet")
+


The DaemonSet is created but never explicitly cleaned up. While Kubernetes may garbage collect when the test cluster is torn down, it's better practice to add explicit cleanup to prevent resource leaks if the test fails before cluster cleanup. E2E tests should be self-contained and clean up resources they create.

Consider using s.T.Cleanup() or defer to delete the DaemonSet at the end of the validator function, similar to how other e2e tests handle resource cleanup.

Suggested change

// Ensure the DaemonSet is cleaned up after the test finishes

s.T.Cleanup(func() {

s.T.Logf("Cleaning up NVIDIA device plugin DaemonSet %s/%s...", ds.Namespace, ds.Name)

if err := s.Runtime.Cluster.Kube.Typed.AppsV1().DaemonSets(ds.Namespace).Delete(

context.Background(),

ds.Name,

metav1.DeleteOptions{},

); err != nil {

s.T.Logf("Failed to delete NVIDIA device plugin DaemonSet %s/%s: %v", ds.Namespace, ds.Name, err)

}

})

Copilot · 2026-02-26T19:27:33Z

e2e/scenario_gpu_daemonset_test.go

+					Containers: []corev1.Container{
+						{
+							Name:  "nvidia-device-plugin-ctr",
+							Image: "mcr.microsoft.com/oss/v2/nvidia/k8s-device-plugin:v0.18.2",
+							Env: []corev1.EnvVar{
+								{
+									Name:  "FAIL_ON_INIT_ERROR",
+									Value: "false",
+								},
+							},
+							SecurityContext: &corev1.SecurityContext{
+								AllowPrivilegeEscalation: to.Ptr(false),
+								Capabilities: &corev1.Capabilities{
+									Drop: []corev1.Capability{"ALL"},
+								},
+							},


The DaemonSet container spec is missing several configuration options present in the upstream NVIDIA device plugin deployment (https://github.com/NVIDIA/k8s-device-plugin/blob/main/deployments/static/nvidia-device-plugin.yml). Specifically:

Missing the /dev/null device volume mount which the device plugin may need

The upstream spec typically includes additional environment variables like NVIDIA_MIG_MONITOR_DEVICES, NVIDIA_DRIVER_CAPABILITIES, etc.

The upstream spec sets Privileged: true in SecurityContext rather than just dropping capabilities

While the minimal spec may work for basic testing, using a configuration closer to the upstream deployment would make this test more valuable as it validates the real-world customer scenario. Consider either aligning more closely with the upstream spec or adding a comment explaining why this simplified configuration is sufficient for the test's purpose.

Copilot · 2026-02-26T19:27:33Z

e2e/scenario_gpu_daemonset_test.go

+				waitForNvidiaDevicePluginDaemonsetReady(ctx, s)
+
+				// Validate that GPU resources are advertised by the device plugin
+				ValidateNodeAdvertisesGPUResources(ctx, s, 1, "nvidia.com/gpu")
+
+				// Validate that GPU workloads can be scheduled
+				ValidateGPUWorkloadSchedulable(ctx, s, 1)


After waiting for the device plugin pod to be ready, consider adding validation to check the pod logs for successful GPU discovery and device plugin registration. This would catch issues where the pod starts but the device plugin fails to function correctly (e.g., driver compatibility issues, incorrect configuration).

The logs should contain messages about discovering GPUs and registering with kubelet's device plugin framework. This would make the test more comprehensive and catch edge cases where the pod runs but doesn't actually register GPU resources properly.

e2e/scenario_gpu_daemonset_test.go

surajssd

nvm

Add a new e2e test that validates GPU nodes work correctly when the NVIDIA device plugin is deployed as a Kubernetes DaemonSet instead of a systemd service. This tests the upstream deployment model commonly used by customers who manage their own device plugin deployment. The test: - Provisions a GPU node with drivers but without systemd device plugin - Deploys nvidia-device-plugin v0.18.2 as a DaemonSet from MCR - Validates GPU resources are advertised and workloads can be scheduled

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- Use unique DaemonSet name per node to avoid collisions in shared cluster - Add cleanup to delete DaemonSet when test finishes - Use Privileged mode matching upstream NVIDIA device plugin spec - Use existing WaitUntilPodRunning helper instead of custom wait loop - Add comments explaining image version choice

- Extract image version to constant for easier updates - Add validation that systemd device plugin is not running - Truncate DaemonSet name to 63 chars (K8s limit) - Add timeout contexts to cleanup operations - Delete existing DaemonSet before create for idempotency

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

Copilot · 2026-03-03T18:46:52Z

e2e/scenario_gpu_daemonset_test.go

+	// Using "is-active" which returns non-zero if not active
+	result := execScriptOnVMForScenario(ctx, s, "systemctl is-active nvidia-device-plugin.service 2>/dev/null || echo 'not-running'")


systemctl is-active ... || echo 'not-running' can produce multi-line stdout (e.g., inactive plus not-running) because is-active prints a status string even when it exits non-zero. This makes the log/output ambiguous and can mask unexpected states; consider using systemctl is-active --quiet ... (and then echo a single status) or otherwise ensure only one deterministic status string is captured.

Suggested change

// Using "is-active" which returns non-zero if not active

result := execScriptOnVMForScenario(ctx, s, "systemctl is-active nvidia-device-plugin.service 2>/dev/null || echo 'not-running'")

// Using "is-active --quiet" and echoing a single deterministic status string

result := execScriptOnVMForScenario(ctx, s, "if systemctl is-active --quiet nvidia-device-plugin.service 2>/dev/null; then echo 'active'; else echo 'not-running'; fi")

Copilot · 2026-03-03T18:46:53Z

e2e/scenario_gpu_daemonset_test.go

+func nvidiaDevicePluginDaemonsetName(nodeName string) string {
+	prefix := "nvdp-" // Short prefix to leave room for node name
+	maxLen := 63
+	name := prefix + nodeName
+	if len(name) > maxLen {
+		name = name[:maxLen]
+	}
+	return name


nvidiaDevicePluginDaemonsetName truncates to 63 chars but doesn’t ensure the result is a valid DNS-1123 label (must end with an alphanumeric). If truncation lands on -, DaemonSet creation/label matching will fail. Consider trimming trailing - after truncation (similar to truncatePodName in validators.go) and, to avoid collisions across long node names, appending a short hash suffix derived from the full node name.

Copilot · 2026-03-03T18:46:53Z

e2e/scenario_gpu_daemonset_test.go

+	deleteCtx, deleteCancel := context.WithTimeout(ctx, 30*time.Second)
+	defer deleteCancel()
+	_ = s.Runtime.Cluster.Kube.Typed.AppsV1().DaemonSets(ds.Namespace).Delete(
+		deleteCtx,
+		ds.Name,
+		metav1.DeleteOptions{},
+	)
+
+	// Create the DaemonSet
+	err := s.Runtime.Cluster.Kube.CreateDaemonset(ctx, ds)
+	require.NoError(s.T, err, "failed to create NVIDIA device plugin DaemonSet")


The pre-delete of the DaemonSet is asynchronous; immediately calling create with the same name can still fail with AlreadyExists while the old DaemonSet is terminating. To avoid flaky reruns after a failed/aborted test, consider waiting for the DaemonSet to be fully deleted (poll for NotFound) or switching to an idempotent CreateOrUpdate that actually updates the existing DaemonSet spec.

Copilot AI review requested due to automatic review settings February 26, 2026 17:55

ganeshkumarashok requested review from AbelHu, Devinwong, awesomenix, calvin197, cameronmeissner, djsly, juan-lee, junjiezhang1997, lilypan26, mxj220, pdamianov-dev, phealy, r2k1, timmy-wright and zachary-bailey as code owners February 26, 2026 17:55

ganeshkumarashok temporarily deployed to test February 26, 2026 17:56 — with GitHub Actions Inactive

Copilot started reviewing on behalf of ganeshkumarashok February 26, 2026 17:57 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

e2e/scenario_gpu_daemonset_test.go Outdated Show resolved Hide resolved

sulixu approved these changes Feb 26, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings February 26, 2026 19:22

ganeshkumarashok temporarily deployed to test February 26, 2026 19:22 — with GitHub Actions Inactive

Copilot started reviewing on behalf of ganeshkumarashok February 26, 2026 19:24 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

surajssd requested changes Feb 26, 2026

View reviewed changes

e2e/scenario_gpu_daemonset_test.go Outdated Show resolved Hide resolved

e2e/scenario_gpu_daemonset_test.go Outdated Show resolved Hide resolved

e2e/scenario_gpu_daemonset_test.go Outdated Show resolved Hide resolved

surajssd reviewed Feb 26, 2026

View reviewed changes

e2e/scenario_gpu_daemonset_test.go Show resolved Hide resolved

surajssd reviewed Feb 26, 2026

View reviewed changes

ganeshkumarashok temporarily deployed to test February 26, 2026 23:47 — with GitHub Actions Inactive

surajssd approved these changes Feb 27, 2026

View reviewed changes

ganeshkumarashok and others added 4 commits March 3, 2026 10:41

Update e2e/scenario_gpu_daemonset_test.go

978ca6d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix comments to accurately describe GPU device plugin behavior

2eae10e

Copilot AI review requested due to automatic review settings March 3, 2026 18:42

ganeshkumarashok force-pushed the aganeshkumar/nvidia-device-plugin-daemonset-e2e branch from 54fe957 to 2eae10e Compare March 3, 2026 18:42

ganeshkumarashok had a problem deploying to test March 3, 2026 18:42 — with GitHub Actions Failure

Copilot started reviewing on behalf of ganeshkumarashok March 3, 2026 18:43 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

ganeshkumarashok merged commit ca8a90e into main Mar 3, 2026
21 of 29 checks passed

ganeshkumarashok deleted the aganeshkumar/nvidia-device-plugin-daemonset-e2e branch March 3, 2026 18:47

mxj220 pushed a commit that referenced this pull request Mar 3, 2026

test: add e2e test for NVIDIA device plugin as DaemonSet (#7964)

0338a2f

+	// Ensure the DaemonSet is cleaned up after the test finishes
+	s.T.Cleanup(func() {
+		s.T.Logf("Cleaning up NVIDIA device plugin DaemonSet %s/%s...", ds.Namespace, ds.Name)
+		if err := s.Runtime.Cluster.Kube.Typed.AppsV1().DaemonSets(ds.Namespace).Delete(
+			context.Background(),
+			ds.Name,
+			metav1.DeleteOptions{},
+		); err != nil {
+			s.T.Logf("Failed to delete NVIDIA device plugin DaemonSet %s/%s: %v", ds.Namespace, ds.Name, err)
+		}
+	})

		// Using "is-active" which returns non-zero if not active
		result := execScriptOnVMForScenario(ctx, s, "systemctl is-active nvidia-device-plugin.service 2>/dev/null \|\| echo 'not-running'")

Conversation

ganeshkumarashok commented Feb 26, 2026

Summary

Test Details

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

surajssd left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

surajssd left a comment •

edited

Loading